Method and system for measuring a system&#39;s transmission quality

ABSTRACT

Method and system for measuring transmission quality of an audio transmission system under test. Specifically, an input signal (X), such as an original input speech signal, is applied to the audio transmission system which results in an output signal (Y) produced by the transmission system. Both signals X and Y are mutually processed to yield a perceived quality signal. In accordance with the invention, output signal Y and/or input signal X are scaled such that, depending on a ratio of power of these two signals, relatively small deviations of power between these signals are compensated, while relatively larger deviations are only partially compensated. Further, an artificial reference speech signal may be created for which noise levels present in the input speech signal are reduced by a scale factor which reflects a local level of the noise in that input signal.

FIELD OF THE INVENTION

The invention refers to a method and a system for measuring thetransmission quality of a system under test, an input signal enteredinto the system under test and an output signal resulting from thesystem under test being processed and mutually compared.

BACKGROUND OF THE INVENTION

Draft ITU-T recommendation P.862, “Telephone transmission quality,telephone installations, local line networks—Methods for objective andsubjective assessment of quality—Perceptual evaluation of speech quality(PESQ) [see reference 8], an objective method for end-to-end speechquality assessment of narrow-bank telephone networks and speech codecs”,ITU-T 02.2001, discloses prior art PESQ methods and systems.

Measuring the quality of audio signals, degraded in audio processing ortransmission systems, may have poor results for very weak or silentportions in the input signal. The methods and systems known fromRecommendation P.862 have the disadvantage that they do not compensatefor differences in power level on a frame by frame basis correctly.These differences are caused by gain variations or noise in the inputsignal. The incorrect compensation leads to low correlations betweensubjective and objective scores, especially when the original referenceinput speech signal contains low levels of noise.

According to a prior art method and system, disclosed in applicant'sEP01200945, improvements are achieved by applying a first scaling stepin a pre-processing stage with a first scaling factor which is afunction of the reciprocal value of the power of the output signalincreased by an adjustment value. A second scaling step is applied witha second scaling factor which is substantially equal to the firstscaling factor raised to an exponent having a adjustment value betweenzero and one. The second scaling step may be carried out on variouslocations in the device, while the adjustment values are adjusted usingtest signals with well defined subjective quality scores.

Both, in the methods and systems of Recommendation P.862 and EP01200945the degraded output signal is scaled locally to match the referenceinput signal in the power domain.

It has been found that the results of the (perceptual) qualitymeasurement process can be improved by application of “soft-scaling” atleast one stage of the method and system respectively.

Introduction of “soft-scaling” instead of “hard scaling” (using “hard”scaling thresholds) is based on the observation and understandingthat—the field of the invention relates assessment of audio quality asexperienced by human users—human audio perception mechanisms rather use“soft thresholds” than “hard thresholds”. Based on that observation anda better understanding of how those human audio scaling mechanism works,the present invention presents such “soft-scaling” mechanisms, to beadded to or inserted into the prior art method or system respectively.

SUMMARY OF THE INVENTION

According to an aspect of the invention the output signal and/or theinput signal of a system are scaled, in a way that small deviations ofthe power are compensated, while larger deviations are compensatedpartially in a manner that is dependent on the power ratio.

According to a further elaboration of the invention an artificialreference speech signal may be created, for which the noise levels aspresent in the original input speech signal are lowered by a scalingfactor that depends on the local level of the noise in this input.

The result of the inventive measures is a more correct prediction of thesubjectively perceived end-to-end speech quality for speech signalswhich contain variations in the local scaling, especially in the casewhere soft speech parts and silences are degraded by low levels ofnoise.

In the soft-scaling algorithm, two different types of signal processingare used to improve the correlation between subjectively perceivedquality and objectively measured quality.

In the first soft-scale processing, controlled by a first sub-algorithm,the compensation used in Recommendation P.862 to correct for local gainchanges in the output signal, is improved by scaling the output (or theinput) in such way that small deviations of the power are compensated(preferably per time frame or period) while larger deviations arecompensated partially, dependent on the power ratio.

A preferred simple and effective implementation takes the local powers,i.e., the power in each frame (of, e.g., 30 ms.) and calculates a localcompensation ratio F:F=(PX+Δ)/(PY+Δ)*)which F is amplitude clipped at levels mm and MM to get a clipped ratioC:C=mm whenever F<mm≦1.0andC=MM whenever F>MM≧1.0while otherwiseC=F

-   -   *) “Δ” is used to optimize the value of C for small values of        PY.

The clipped ratio C is then used to calculate a soft-scale ratio S byusing factors m and M, with mm<m≦1.0 and MM>M≧1.0:S=C ^(a) +C−C(m)^(a−1) whenever C<m with 0.5<a<1.0andS=C ^(a) +C−C(M)^(a−1) whenever C>M with 0.5<a<1.0while otherwiseS=C

-   -   “a” may be used as a (first) tuning parameter.        In this way the local scaling in the present invention is        equivalent to the scaling as given in the prior art documents        Recommendation P.862 and EP01200945 as long as m≦F≦M. However        for values F<m or F>M the scaling is progressively deviating        less from 1.0 then the scaling as given in the prior art. The        soft-scale factor S is used in the same way F is used in the        prior art methods and systems to compensate the output power in        each frame locally.

In the second soft-scale processing, controlled by a secondsub-algorithm, the compensation used is focused on low level parts ofthe input signal.

When the input signal (reference signal) contains low levels of noise, atransparent speech transport system will give an output speech signalthat also contains low levels of noise. The output of the speechtransport system is then judged of having lower quality than expected onthe basis of the noise introduced by the transport system. One wouldonly be aware of the fact that the noise is not caused by the transportsystem if one could listen to the input speech signal and make acomparison. However in most subjective speech quality tests, the inputreference is not presented to the testing subject and consequently thesubject judges low noise level differences in the input signal asdifferences in quality of the speech transport system. In order to havehigh correlations, in objective test systems, with such subjectivetests, this effect has to be emulated in an advanced objective speechquality assessment algorithm.

The present preferred option of the invention emulates this byeffectively creating a new, virtual, artificial reference speech signalin the power representation domain for which the noise power levels arelowered by a scaling factor that depends on the local level of the noisein the input signal. Thus the newly created artificial reference signalconverges to zero faster than the original input signal for low levelsof this input signal. When the disturbances in the degraded outputsignal are calculated during low level signal parts, as present in thereference input signal, the difference calculation in the internalrepresentation loudness domain is carried out after scaling of the inputloudness signal to a level that goes to zero faster than the loudness ofthe input signal as it approaches zero.

According to the prior art method disclosed in EP01200945, theprocessing implies mapping of the (degraded) output signal (Y(t)) andthe reference signal (X(t)) on representation signals LY and LXaccording to a psycho-physical perception model of the human auditorysystem. A differential or disturbance signal (D) is determined by“differentiating means” from those representation signals, whichdisturbance signal is then processed by modeling means in accordancewith a cognitive model, in which certain properties of human testeeshave been modeled, in order to obtain the quality signal Q.

As said above, the difference calculation in the internal representationloudness domain is, within the scope of the present invention,preferably carried out after scaling the input loudness signal to alevel that goes to zero faster than the loudness of the input signal asit approaches zero.

An effective implementation of this is achieved by using the differencein internal representation in the time-frequency plane calculated fromLX(f)_(n) and LY(f)_(n)—see EP01200945—asD(f)_(n) =|LY(f)_(n) −LX(f)_(n) |and replacing this by:D(f)_(n) =|LY(f)_(n) −H(t,f)|withH(t,f)=LX(f)_(n) ^(b) /K ^(b−1) for all LX(f)_(n)<KandH(t,f)=LX(f)_(n) for all LX(f)_(n)≧KIn these formula is b>1 while K represents the low level noise powercriterion per time frequency cell, dependent on the specificimplementation.

This second soft-scale processing sub-algorithm can also be implementedby replacing the LX(f)_(n)<K criterion by a power criterion in a singletime frame, i.e.:D(f)_(n) =|LY(f)_(n) −H(t,f)|withH(t,f)=LX(f)_(n) ^(b) /K ^(b−1) for all LX(t)<K′andH(t,f)=LX(f)_(n) for all LX(t)≧K′In these formula is b>1 while K′ represents the low level noise powercriterion per time frame which is dependent on the specificimplementation.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows schematically a prior art PESQ system, disclosed in ITU-Trecommendation P.862.

FIG. 2 shows the same PESQ system which, however, is modified to be fitfor executing the method as presented above by the use of a first and,preferably, a second new module.

FIG. 3 shows the first new module of the PESQ system.

FIG. 4 shows the second new module of the PESQ system.

DETAILED DESCRIPTION OF THE DRAWINGS

The PESQ system shown in FIG. 1 compares an original signal (inputsignal) X(t) with a degraded signal (output signal) Y(t) that is theresult of passing X(t) through, e.g., a communication system. The outputof the PESQ system is a prediction of the perceived quality that wouldbe given to Y(t) by subjects in a subjective listening test.

In the first step executed by the PESQ system a series of delays betweenoriginal input and degraded output are computed, one for each timeinterval for which the delay is significantly different from theprevious time interval. For each of these intervals a correspondingstart and stop point is calculated. The alignment algorithm is based onthe principle of comparing the confidence of having two delays in acertain time interval with the confidence of having a single delay forthat interval. The algorithm can handle delay changes both duringsilences and during active speech parts.

Based on the set of delays that are found, the PESQ system compares theoriginal (input) signal with the aligned degraded output of the deviceunder test using a perceptual model. The key to this process istransformation of both the original and the degraded signals to internalrepresentations (LX, LY), analogous to the psychophysical representationof audio signals in the human auditory system, taking account ofperceptual frequency (Bark) and loudness (Sone). This is achieved inseveral stages: time alignment, level alignment to a calibratedlistening level, time-frequency mapping, frequency warping, andcompressive loudness scaling.

The internal representation is processed to take account of effects suchas local gain variations and linear filtering that may—if they are nottoo severe—have little perceptual significance. This is achieved bylimiting the amount of compensation and making the compensation lagbehind the effect. Thus minor, steady-state differences betweencorresponding original and degraded speech signals are compensated. Moresevere effects, or rapid variations, are only partially compensated sothat a residual effect remains and contributes to the overall perceptualdisturbance. This allows a small number of quality indicators to be usedto model all subjective effects. In the PESQ system, two errorparameters are computed in the cognitive model; these are combined togive an objective listening quality MOS (Mean Opinion Score). The basicideas used in the PESQ system are described in the bibliographyreferences [1] to [5].

The Perceptual Model in the Prior Art PESQ System

The perceptual model of a PESQ system, shown in FIG. 1, is used tocalculate a distance between the original and degraded speech signal(“PESQ score”). This may be passed through a monotonic function toobtain a prediction of a subjective MOS for a given subjective test. ThePESQ score is mapped to a MOS-like scale, a single number in the rangeof −0.5 to 4.5, although for most cases the output range will be between1.0 and 4.5, the normal range of MOS values found in an ACR listeningquality experiment.

Precomputation of Constant Settings

Certain constant values and functions are pre-computed. For those thatdepend on the sample frequency, versions for both 8 and 16 kHz samplefrequency are stored in the program.

FFT Window Size and Sample Frequency

In the PESQ system the time signals are mapped to the time frequencydomain using a short term FFT (Fast Fourier Transformation) with a Hannwindow of size 32 ms. For 8 kHz this amounts to 256 samples per windowand for 16 kHz the window counts 512 samples while adjacent frames areoverlapped by 50%.

Absolute Hearing Threshold

The absolute hearing threshold P₀(f) is interpolated to get the valuesat the center of the Bark bands that are used. These values are storedin an array and are used in Zwicker's loudness formula.

The Power Scaling Factor

There is an arbitrary gain constant following the FFT for time-frequencyanalysis. This constant is computed from a sine wave of a frequency of 1000 Hz with an amplitude at 29.54 (40 dB SPL) transformed to thefrequency domain using the windowed FFT over 32 ms. The (discrete)frequency axis is then converted to a modified Bark scale by binning ofFFT bands. The peak amplitude of the spectrum binned to the Barkfrequency scale (called the “pitch power density”) must then be 10 000(40 dB SPL). The latter is enforced by a postmultiplication with aconstant, the power scaling factor S_(p).

The Loudness Scaling Factor

The same 40 dB SPL reference tone is used to calibrate thepsychoacoustic (Sone) loudness scale. After binning to the modified Barkscale, the intensity axis is warped to a loudness scale using Zwicker'slaw, based on the absolute hearing threshold. The integral of theloudness density over the Bark frequency scale, using a calibration toneat 1 000 Hz and 40 dB SPL, must then yield a value of 1 Sone. The latteris enforced by a postmultiplication with a constant, the loudnessscaling factor S.

IRS-Receive Filtering

As stated in section 10.1.2 of Draft ITU recommendation P.8672[reference 8], it is assumed that the listening tests were carried outusing an IRS receive or a modified IRS receive characteristic in thehandset. The necessary filtering to the speech signals is alreadyapplied in the pre-processing.

Computation of the Active Speech Time Interval

If the original and degraded speech file start or end with large silentintervals, this could influence the computation of certain averagedistortion values over the files. Therefore, an estimate is made of thesilent parts at the beginning and end of these files. The sum of fivesuccessive absolute sample values must exceed 500 from the beginning andend of the original speech file in order for that position to beconsidered as the start or end of the active interval. The intervalbetween this start and end is defined as the active speech timeinterval. In order to save computation cycles and/or storage size, somecomputations can be restricted to the active interval.

Short Term FFT

The human ear performs a time-frequency transformation. In the PESQsystem this is implemented by a short term FFT with a window size of 32ms. The overlap between successive time windows (frames) is 50 percent.The power spectra—the sum of the squared real and squared imaginaryparts of the complex FFT components—are stored in separate real valuedarrays for the original and degraded signals. Phase information within asingle Hann window is discarded in the PESQ system and all calculationsare based on only the power representations PX_(WIRSS)(f)_(n) andPY_(WIRSS)(f)_(n). The start points of the windows in the degradedsignal are shifted over the delay. The time axis of the original speechsignal is left as is. If the delay increases, parts of the degradedsignal are omitted from the processing, while for decreases in the delayparts are repeated.

Calculation of the Pitch Power Densities

The Bark scale reflects that at low frequencies, the human hearingsystem has a finer frequency resolution than at high frequencies. Thisis implemented by binning FFT bands and summing the corresponding powersof the FFT bands with a normalization of the summed parts. The warpingfunction that maps the frequency scale in Hertz to the pitch scale inBark does not exactly follow the values given in the literature. Theresulting signals are known as the pitch power densitiesPPX_(WIRSS)(f)_(n) and PPY_(WIRSS)(f)_(n).

Partial Compensation of the Original Pitch Power Density

To deal with filtering in the system under test, the power spectrum ofthe original and degraded pitch power densities are averaged over time.This average is calculated over speech active frames only usingtime-frequency cells whose power is more than 1 000 times the absolutehearing threshold. Per modified Bark bin, a partial compensation factoris calculated from the ratio of the degraded spectrum to the originalspectrum. The maximum compensation is never more than 20 dB. Theoriginal pitch power density PPX_(WIRSS)(f)_(n) of each frame n is thenmultiplied with this partial compensation factor to equalize theoriginal to the degraded signal. This results in an inversely filteredoriginal pitch power density PPX′_(WIRSS) (f)_(n). This partialcompensation is used because severe filtering can be disturbing to thelistener. The compensation is carried out on the original signal becausethe degraded signal is the one that is judged by the subjects in an ACRexperiment.

Partial Compensation of the Distorted Pitch Power Density

Short-term gain variations are partially compensated by processing thepitch power densities frame by frame. For the original and the degradedpitch power densities, the sum in each frame n of all values that exceedthe absolute hearing threshold is computed. The ratio of the power inthe original and the degraded files is calculated and bounded to therange [3·10⁻⁴, 5]. A first order low pass filter (along the time axis)is applied to this ratio. The distorted pitch power density in eachframe, n, is then multiplied by this ratio, resulting in the partiallygain compensated distorted pitch power density PPY′_(WIRSS)(f)_(n).

Calculation of the Loudness Densities

After partial compensation for filtering and short-term gain variations,the original and degraded pitch power densities are transformed to aSone loudness scale using Zwicker's law [7].

${{LX}(f)}_{n} = {S_{l} \cdot \left( \frac{P_{0}(f)}{0.5} \right)^{\gamma} \cdot \left\lbrack {\left( {0.5 + {0.5 \cdot \frac{{{{PPX}^{\prime}}_{WIRSS}(f)}_{n}}{P_{0}(f)}}} \right)^{\gamma} - 1} \right\rbrack}$with P₀(f) the absolute threshold and S₁ the loudness scaling factor.

Above 4 Bark, the Zwicker power, γ, is 0.23, the value given in theliterature. Below 4 Bark, the Zwicker power is increased slightly toaccount for the so-called recruitment effect. The resultingtwo-dimensional arrays LX(f)_(n) and LY(f)_(n) are called loudnessdensities.

Calculation of the Disturbance Density

The signed difference between the distorted and original loudnessdensity is computed. When this difference is positive, components suchas noise have been added. When this difference is negative, componentshave been omitted from the original signal. This difference array iscalled the raw disturbance density.

The minimum of the original and degraded loudness density is computedfor each time frequency cell. These minima are multiplied by 0.25. Thecorresponding two-dimensional array is called the mask array. Thefollowing rules are applied in each time-frequency cell:

-   -   If the raw disturbance density is positive and larger than the        mask value, the mask value is subtracted from the raw        disturbance.    -   If the raw disturbance density lies in between plus and minus        the magnitude of the mask value the disturbance density is set        to zero.    -   If the raw disturbance density is more negative than minus the        mask value, the mask value is added to the raw disturbance        density.

The net effect is that the raw disturbance densities are pulled towardszero. This represents a dead zone before an actual time frequency cellis perceived as distorted. This models the process of small differencesbeing inaudible in the presence of loud signals (masking) in eachtime-frequency cell. The result is a disturbance density as a functionof time (window number n) and frequency, D(f)_(n).

Cell-Wise Multiplication with an Asymmetry Factor

The asymmetry effect is caused by the fact that when a codec distortsthe input signal it will in general be very difficult to introduce a newtime-frequency component that integrates with the input signal, and theresulting output signal will thus be decomposed into two differentpercepts, the input signal and the distortion, leading to clearlyaudible distortion [2]. When the codec leaves out a time-frequencycomponent the resulting output signal cannot be decomposed in the sameway and the distortion is less objectionable. This effect is modeled bycalculating an asymmetrical disturbance density DA(f)_(n) per frame bymultiplication of the disturbance density D(f)_(n) with an asymmetryfactor. This asymmetry factor equals the ratio of the distorted andoriginal pitch power densities raised to the power of 1.2. If theasymmetry factor is less than 3 it is set to zero. If it exceeds 12 itis clipped at that value. Thus only those time frequency cells remain,as non-zero values, for which the degraded pitch power density exceededthe original pitch power density.

Aggregation of the Disturbance Densities

The disturbance density D(f)_(n) and asymmetrical disturbance densityDA(f)_(n) are integrated (summed) along the frequency axis using twodifferent Lp norms and a weighting on soft frames (having low loudness):

$\begin{matrix}{D_{n} = {M_{n}\sqrt[3]{\sum\limits_{{f = 1},{\ldots\mspace{14mu}{Number}\mspace{11mu}{of}\mspace{11mu}{Barkbands}}}\;\left( {{{D(f)}_{n}\left. W_{f} \right)^{3}}} \right.}}} \\{{DA}_{n} = {M_{n}{\sum\limits_{{f = 1},{\ldots\mspace{14mu}{Number}\mspace{11mu}{of}\mspace{11mu}{Barkbands}}}\;\left( {{{{DA}(f)}_{n}\left. W_{f} \right)}} \right.}}}\end{matrix}$with M_(n) a multiplication factor, 1/(power of original frame plus aconstant)^(0.04), resulting in an emphasis of the disturbances thatoccur during silences in the original speech fragment, and W_(f) aseries of constants proportional to the width of the modified Bark bins.After this multiplication the frame disturbance values are limited to amaximum of 45. These aggregated values, D_(n) and DA_(n), are calledframe disturbances.

Zeroing of the Frame Disturbance

If the distorted signal contains a decrease in the delay larger than 16ms (half a window) the repeat strategy as mentioned in 10.2.4 of DraftITU recommendation P.862 [reference 8] is modified. It was found to bebetter to ignore the frame disturbances during such events in thecomputation of the objective speech quality. As a consequence framedisturbances are zeroed when this occurs. The resulting framedisturbances are called D′_(n) and DA′_(n).

Realignment of Bad Intervals

Consecutive frames with a frame disturbance above a threshold are calledbad intervals. In a minority of cases the objective measure predictslarge distortions over a minimum number of bad frames due to incorrecttime delays observed by the preprocessing. For those so-called badintervals a new delay value is estimated by maximizing the crosscorrelation between the absolute original signal and absolute degradedsignal adjusted according to the delays observed by the preprocessing.When the maximal cross correlation is below a threshold, it is concludedthat the interval is matching noise against noise and the interval is nolonger called bad, and the processing for that interval is halted.Otherwise, the frame disturbance for the frames during the bad intervalsis recomputed and, if it is smaller, it replaces the original framedisturbance. The result is the final frame disturbances D″_(n) andDA″_(n) that are used to calculate the perceived quality.

Aggregation of the Disturbance within Split Second Intervals

Next, the frame disturbance values and the asymmetrical framedisturbance values are aggregated over split second intervals of 20frames (accounting for the overlap of frames: approx. 320 ms) using L₆norms, a higher p value as in the aggregation over the speech filelength. These intervals also overlap 50 percent and no window functionis used.

Aggregation of the Disturbance Over the Duration of the Signal

The split second disturbance values and the asymmetrical split seconddisturbance values are aggregated over the active interval of the speechfiles (the corresponding frames)_(n)ow using L₂ norms. The higher valueof p for the aggregation within split second intervals as compared tothe lower p value of the aggregation over the speech file is due to thefact that when parts of the split seconds are distorted that splitsecond loses meaning, whereas if a first sentence in a speech file isdistorted the quality of other sentences remains intact.

Computation of the PESQ Score

The final PESQ score is a linear combination of the average disturbancevalue and the average asymmetrical disturbance value. The range of thePESQ score is −0.5 to 4.5, although for most cases the output range willbe a listening quality MOS-like score between 1.0 and 4.5, the normalrange of MOS values found in an ACR (Absolute Category Rating)experiment.

FIG. 2 is equal to FIG. 1, with the exception of a first new module,replacing the prior art module for calculating the local scaling factorand a new second module, replacing the prior art module for perceptualsubtraction.

The first new module is fit for execution of the method according to theinvention, comprising means for scaling the output signal and/or theinput signal of the system under test, under control of a new,“soft-scaling” algorithm, compensating small deviations of the power,while compensating larger deviations partially, dependent on the powerratio. The first module is depicted in FIG. 3.

The second new module is fit for execution of a further elaboration ofthe invention, comprising means for the creation of an artificialreference speech signal, for which the noise levels as present in theoriginal input speech signal are lowered by a scaling factor thatdepends on the local level of the noise in this input.

The operation of both new modules are depicted in the form of flowdiagrams, representing the operation of the respective modules. Bothmodules may be implemented in hardware or in software.

FIG. 3 depicts the operation of the first new module shown in FIG. 2.The operation of the module in FIG. 3 is controlled by the firstsub-algorithm as represented by the depicted flow diagram, improving thecompensation function to correct for local gain changes in the outputsignal, by scaling the output and/or input signals in such way thatsmall deviations of the power are compensated, preferably per time frameor period, while larger deviations are compensated partially, dependenton the power ratio. The preferred simple and effective implementation ofthe invention takes the local powers, i.e., the power in each frame (of,e.g., 30 ms.) and calculates a local compensation ratio F=(PX+Δ)/(PY+Δ).

Note: PX and PY are the shorter notations of PPX_(WIRSS)(f)_(n) andPPY_(WIRSS)(f)_(n) respectively as used in the FIGS. 1, 2 and 3.

F is amplitude clipped at levels mm and MM to get a clipped ratio

C=mm for F<mm≦1.0 or C=MM for F>MM≧1.0 or C=F “Δ” for optimizing C forsmall values of PX and/or PY).

The clipped ratio C is used to calculate a soft-scale ratio S by usingfactors m and M, with mm<m≦1.0 and MM>M≧1.0.

Soft-scale ratio S=C^(a)+C−C(m)^(a−1) for C<m (0.5<a<1.0) or

S=C^(a)+C−C (M)^(a−1) for C>M or S=C

In this way the local scaling in the present invention is equivalent tothe scaling as given in the prior art documents Recommendation P.862 andEP01200945 as long as m≦F≦M. However for values F<m or F>M the scalingis progressively deviating less from 1.0 than the scaling as given inthe prior art. The soft-scale factor S is used in the same way F is usedin the prior art methods and systems to compensate the output power ineach frame locally.

In the second soft-scale processing, controlled by a secondsub-algorithm, advanced scaling is applied on low level parts of theinput signal. When the input signal (reference signal) contains lowlevels of noise, a transparent speech transport system will give anoutput speech signal that also contains low levels of noise. The outputof the speech transport system is then judged of having lower qualitythan expected on the basis of the noise introduced by the transportsystem. One would only be aware of the fact that the noise is not causedby the transport system if one could listen to the input speech signaland make a comparison. However in most subjective speech quality teststhe input reference is not presented to the testing subject andconsequently the subject judges low noise level differences in the inputsignal as differences in quality of the speech transport system. Inorder to have high correlations, in objective test systems, with suchsubjective tests, this effect has to be emulated in an advancedobjective speech quality assessment algorithm. The embodiment of thepreferred option of the invention, illustrated in FIG. 4, emulates thisby creating an artificial reference speech signal in the powerrepresentation domain for which the noise power levels are lowered by ascaling factor that depends on the local level of the noise in the inputsignal. Thus the artificial reference signal converges to zero fasterthan the original input signal for low levels of this input signal. Whenthe disturbances in the degraded output signal are calculated during lowlevel signal parts, as present in the reference input signal, thedifference calculation in the internal representation loudness domain iscarried out after scaling of the input loudness signal to a level thatgoes to zero faster than the loudness of the input signal as itapproaches zero.

The difference in internal representation in the time-frequency plane isset to D(f)_(n)=|LY(f)_(n)−LX(f)_(n) ^(b)/K^(b−1)| for LX(f)_(n)<K or

D(f)_(n)=|LY(f)_(n)−LX(f)_(n)| for LX(f)_(n)≧K.

In these formula is b>1 while K represents the low level noise powercriterion per time frequency cell.

As an alternative the second soft-scale processing sub-algorithm canalso be implemented by replacing the LX(f)_(n)<K criterion by a powercriterion in a single time frame. In this alternative option thedifference in internal representation in the time-frequency plane is setto D(f)_(n)=|LY(f)_(n)−LX(f)_(n) ^(b)/K^(b−1)| for LX(t)<K′ or

D(f)_(n)=|LY(f)_(n)−LX(f)_(n)| for LX(t)≧K′.

In these alternative formula is b>1 while K′ represents the low levelnoise power criterion per time frame.

REFERENCES INCORPORATED HEREIN BY REFERENCES

-   [1] BEERENDS (J. G.), STEMERDINK (J. A.): A Perceptual    Speech-Quality Measure Based on a Psychoacoustic Sound    Representation, J. Audio Eng. Soc., Vol. 42, No. 3, pp. 115-123,    March 1994.-   [2] BEERENDS (J. G.): Modelling Cognitive Effects that Play a Role    in the Perception of Speech Quality, Speech Quality Assessment,    Workshop papers, Bochum, pp. 1-9, November 1994.-   [3] BEERENDS (J. G.): Measuring the quality of speech and music    codecs, an integrated psychoacoustic approach, 98th AES Convention,    pre-print No. 3945, 1995.-   [4] HOLLIER (M. P.), HAWKSFORD (M. O.), GUARD (D. R.): Error    activity and error entropy as a measure of psychoacoustic    significance in the perceptual domain, IEE Proceedings—Vision, Image    and Signal Processing, 141 (3), 203-208, June 1994.-   [5] RIX (A. W.), REYNOLDS (R.), HOLLIER (M. P.): Perceptual    measurement of end-to-end speech quality over audio and packet-based    networks, 106th AES Convention, pre-print No. 4873, May 1999.-   [6] HOLLIER (M. P.), HAWKSFORD (M. O.), GUARD (D. R.),    Characterisation of communications systems using a speech-like test    stimulus, Journal of the AES, 41 (12), 1008-1021, December 1993.-   [7] ZWICKER (Feldtkeller): Das Ohr als Nachrichtenempfanger, S.    Hirzel Verlag, Stuttgart, 1967.-   [8] Draft ITU-T recommendation P.862, “Telephone transmission    quality, telephone installations, local line networks—Methods for    objective and subjective assessment of quality—Perceptual evaluation    of speech quality (PESQ), an objective method for end-to-end speech    quality assessment of narrow-bank telephone networks and speech    codecs”, ITU-T 02.2001-   [9] European patent application EP01200945, Koninklijke KPN n.v.

1. A method for use in a system that measures, through use of apsychoacoustic model of human perception, transmission quality of anoutput speech signal (Y) produced by an audio system, the audio systemhaving an input speech signal (X) applied thereto and responsivelyproducing the output speech signal, the output speech signal being adegraded version of the input speech signal, both the input speechsignal and the output speech signal being applied as input to themeasurement system and a quality signal being produced as output therefrom, the method comprising the steps, performed in the measurementsystem, of: a) determining both a local compensation ratio (F)indicative of a ratio of power of the input speech signal (X) to powerof the output speech signal (Y) and, in response to the localcompensation ratio, a variable scale factor (S), wherein the determiningstep comprises the steps of: (a1) calculating the local compensationratio (F) from power representations PX and PY of the time-frequencyrepresentations of the input speech signal (X) and the output signal (Y)respectively, and where F equals a ratio PX/PY; (a2) calculating aclipped ratio C where C is set equal to a first pre-defined clippingvalue mm for F<mm, a second pre-defined clipping value MM for F>MM, or,for all other values, F; and (a3) calculating the scaling ratio (S) froma first scaling factor m and a second scaling factor M, where both m andM are pre-defined values with mm<m≦1 and MM>M≧1, S equals eitherC^(a)+C−C(m)^(a−1) for C<m, or C^(a)+C−C(M)^(a−1) for either C>M or S=C,and ‘a’ is a first tuning parameter with a predefined value between zeroand one; (b) generating, in response to the scale factor and predefinedtime-frequency representations, in accordance with the model, of theinput speech signal and the output speech signal, first and secondsignals such that relatively small deviations in power between the inputspeech signal and the output speech signal are compensated in the firstand second signals while relatively large deviations in the powerbetween the input speech signal and the output speech signal are onlypartially compensated in the first and second signals, wherein thegenerating step comprises one of the steps of: (b1) scaling, in responseto the scale factor (S), the representations of both the input speechsignal (X) and the output signal (Y) to yield a compensated input speechsignal representation and a compensated output signal representation asthe first and second signals, respectively; or (b2) scaling, in responseto the scale factor (S), the representation of the input speech signal(X) to yield a compensated input speech signal representation such thatthe first signal is the compensated input speech signal representationand the second signal is the output signal representation; or (b3)scaling, in response to the scale factor (S), the representation of theoutput signal (Y) to yield a compensated output signal representationsuch that the second signal is the compensated output signalrepresentation and the first signal is the input speech signalrepresentation; (c) comparing the first and second signals to yield adifference there between; (d) ascertaining, in response to thedifference, the transmission quality; and (e) producing, in response tothe transmission quality, the quality signal.
 2. The method recited inclaim 1 further comprising the step of creating an artificial referencespeech signal for which noise levels present in the input speech signal(X) are reduced by a scaling factor which depends on local level of thenoise in the input speech signal.
 3. The method recited in claim 2wherein the comparing step comprises the step of: setting a differenceD(f)_(n) in loudness representations LX(f)_(n) and LY(f)_(n) of theinput speech signal (X) and the output signal (Y), respectively, in atime-frequency plane equal to |LY(f)_(n)−LX(f)_(n) ^(b)/K^(b−1)| forLX(f)_(n)<K, or |LY(f)_(n)−LX(f)_(n)| for LX(f)_(n)≧K, where b is asecond tuning parameter with a predefined value greater than one and Kis a low level noise power criterion value representing a desiredlow-level noise power criterion per time-frequency cell, where LX(f)_(n)and LY(f)_(n) are calculated according to the following equations:$\begin{matrix}{{{LX}(f)}_{n} = {{S\left( \frac{P_{0}(f)}{0.5} \right)}^{\gamma}\left\lbrack {\left( {0.5 + {0.5\frac{{{PX}(f)}_{n}}{P_{o}(f)}}} \right)^{\gamma} - 1} \right\rbrack}} \\{{{LY}(f)}_{n} = {{S\left( \frac{P_{0}(f)}{0.5} \right)}^{\gamma}\left\lbrack {\left( {0.5 + {0.5\frac{{{PY}(f)}_{n}}{P_{o}(f)}}} \right)^{\gamma} - 1} \right\rbrack}}\end{matrix}$ where: P₀(f) is an absolute threshold; S is the scalefactor; and γ is 0.23 for loudness above 4 Bark and, for loudness lessthan 4 Bark, is a predefined value higher than 0.23.
 4. The methodrecited in claim 2 wherein the comparing step comprises the step of:setting a difference D(f)_(n) in loudness representations LX(f)_(n) andLY(f)_(n) of the input speech signal (X) and the output signal (Y),respectively, in a time-frequency plane equal to |LY(f)_(n)−LX(f)_(n)^(b)/K^(b−1)| for LX(t)<K′, or |LY(f)_(n)−LX(f)_(n)| for LX(t)≧K′, whereb is a second tuning parameter with a predefined value greater than oneand K′ is a low level noise power criterion value representing a desiredlow-level noise power criterion per time frame, where LX(f)_(n) andLY(f)_(n) are calculated according to the following equations:$\begin{matrix}{{{LX}(f)}_{n} = {{S\left( \frac{P_{0}(f)}{0.5} \right)}^{\gamma}\left\lbrack {\left( {0.5 + {0.5\frac{{{PX}(f)}_{n}}{P_{o}(f)}}} \right)^{\gamma} - 1} \right\rbrack}} \\{{{LY}(f)}_{n} = {{S\left( \frac{P_{0}(f)}{0.5} \right)}^{\gamma}\left\lbrack {\left( {0.5 + {0.5\frac{{{PY}(f)}_{n}}{P_{o}(f)}}} \right)^{\gamma} - 1} \right\rbrack}}\end{matrix}$ where: P₀(f) is an absolute threshold; S is the scalefactor; and γ is 0.23 for loudness above 4 Bark and, for loudness lessthan 4 Bark, is a predefined value higher than 0.23.
 5. Apparatus formeasuring, through use of a psychoacoustic model of human perception,transmission quality of an output speech signal (Y) produced by an audiosystem, the audio system having an input speech signal (X) appliedthereto and responsively producing the output speech signal, the outputspeech signal being a degraded version of the input speech signal, boththe input speech signal and the output speech signal being applied asinput to the measurement system and a quality signal being produced asoutput there from, the apparatus comprising: (a) means for determiningboth a local compensation ratio (F) indicative of a ratio of power ofthe input speech signal (X) to power of the output speech signal (Y)and, in response to the local compensation ratio, a variable scalefactor (S), wherein the determining means comprises: (a1) means forcalculating the local compensation ratio (F) from power representationsPX and PY of the time-frequency representations of the input speechsignal (X) and the output signal (Y), respectively, and where F equals aratio PX/PY; (a2) means for calculating a clipped ratio C where C is setequal to a first pre-defined clipping value mm for F<mm, a secondpre-defined clipping value MM for F>MM, or, for all other values, F; and(a3) means for calculating the scaling ratio (S) from a first scalingfactor m and a second scaling factor M, where both m and M arepre-defined values with mm<m≦1 and MM>M≧1, S equals either C^(a)+C−C(m)^(a−1) for C<m, or C^(a)+C−C(M)^(a−1) for either C>M or S=C, and ‘a’is a first tuning parameter with a predefined value between zero andone; (b) means for generating, in response to the scale factor andpredefined time-frequency representations, in accordance with the model,of the input speech signal and the output speech signal, first andsecond signals such that relatively small deviations in power betweenthe input speech signal and the output speech signal are compensated inthe first and second signals while relatively large deviations in thepower between the input speech signal and the output speech signal areonly partially compensated in the first and second signals, wherein thegenerating means comprises: (b1) means for scaling, in response to thescale factor (S), the representations of both the input speech signal(X) and the output signal (Y) to yield a compensated input speech signalrepresentation and a compensated output signal representation as thefirst and second signals, respectively; or (b2) means for scaling, inresponse to the scale factor (S), the representation of the input speechsignal (X) to yield a compensated input speech signal representationsuch that the first signal is the compensated input speech signalrepresentation and the second signal is the output signalrepresentation; or (b3) means for scaling, in response to the scalefactor (S), the representation of the output signal (Y) to yield acompensated output signal representation such that the second signal isthe compensated output signal representation and the first signal is theinput speech signal representation; (c) means for comparing the firstand second signals to yield a difference there between; and (d) meansfor ascertaining, in response to the difference, the transmissionquality and for producing, in response to the transmission quality, thequality signal.
 6. The apparatus recited in claim 5 further comprisingmeans for creating an artificial reference speech signal for which noiselevels present in the input speech signal (X) are reduced by a scalingfactor which depends on local level of the noise in the input speechsignal.
 7. The apparatus recited in claim 6 wherein the comparing meanscomprises: means for setting a difference D(f)_(n) in loudnessrepresentations LX(f)_(n) and LY(f)_(n) of the input speech signal (X)and the output signal (Y), respectively, in a time-frequency plane equalto |LY(f)_(n)−LX(f)_(n) ^(b)/K^(b−1)| for LX(f)_(n)<K, or|LY(f)_(n)−LX(f)_(n)| for LX(f)_(n)≧K, where b is a second tuningparameter with a predefined value greater than one and K is a low levelnoise power criterion value representing a desired low-level noise powercriterion per time-frequency cell, where LX(f)_(n) and LY(f)_(n) arecalculated according to the following equations: $\begin{matrix}{{{LX}(f)}_{n} = {{S\left( \frac{P_{0}(f)}{0.5} \right)}^{\gamma}\left\lbrack {\left( {0.5 + {0.5\frac{{{PX}(f)}_{n}}{P_{o}(f)}}} \right)^{\gamma} - 1} \right\rbrack}} \\{{{LY}(f)}_{n} = {{S\left( \frac{P_{0}(f)}{0.5} \right)}^{\gamma}\left\lbrack {\left( {0.5 + {0.5\frac{{{PY}(f)}_{n}}{P_{o}(f)}}} \right)^{\gamma} - 1} \right\rbrack}}\end{matrix}$ where: P₀(f) is an absolute threshold; S is the scalefactor; and γ is 0.23 for loudness above 4 Bark and, for loudness lessthan 4 Bark, is a predefined value higher than 0.23.
 8. The apparatusrecited in claim 6 wherein the comparing means comprises: means forsetting a difference D(f)_(n) in loudness representations LX(f)_(n) andLY(f)_(n) of the input speech signal (X) and the output signal (Y),respectively, in a time-frequency plane equal to |LY(f)_(n)−LX (f)_(n)^(b)/K^(b−1)| for LX(t)<K′, or |LY(f)_(n)−LX(f)_(n)| for LX(t)≧K′, whereb is a second tuning parameter with a predefined value greater than oneand K′ is a low level noise power criterion value representing a desiredlow-level noise power criterion per time frame, where LX(f)_(n) andLY(f)_(n) are calculated according to the following equations:$\begin{matrix}{{{LX}(f)}_{n} = {{S\left( \frac{P_{0}(f)}{0.5} \right)}^{\gamma}\left\lbrack {\left( {0.5 + {0.5\frac{{{PX}(f)}_{n}}{P_{o}(f)}}} \right)^{\gamma} - 1} \right\rbrack}} \\{{{LY}(f)}_{n} = {{S\left( \frac{P_{0}(f)}{0.5} \right)}^{\gamma}\left\lbrack {\left( {0.5 + {0.5\frac{{{PY}(f)}_{n}}{P_{o}(f)}}} \right)^{\gamma} - 1} \right\rbrack}}\end{matrix}$ where: P₀(f) is an absolute threshold; S is the scalefactor; and γ is 0.23 for loudness above 4 Bark and, for loudness lessthan 4 Bark, is a predefined value higher than 0.23.